Using VADER to perform simple sentiment analysis

Using VADER to perform simple sentiment analysis

Code
%store -r reddit_df
reddit_df = reddit_df

import pandas as pd

The compound score is a single, normalized number representing the overall sentiment of the text, ranging from -1 (most negative) to +1 (most positive). VADER calculates this by looking up each word’s sentiment (valence) in its dictionary, adjusting for things like capital letters (“GREAT”), punctuation (“!”), and modifiers (“very”). It then combines these individual scores and normalizes them to produce the final compound value.

Code
from vaderSentiment.vaderSentiment import SentimentIntensityAnalyzer
senti_analyzer = SentimentIntensityAnalyzer()

Combining the titles and the content of the posts for a “full text” column

Code
reddit_df["full_text"] = reddit_df["title"] + "\n" + reddit_df["content"]

This function finds the sentiment score of every single full post in the dataframe

Code
def get_sentiment_scores(text):
    if pd.isna(text) or text == '':
        return {'neg': 0, 'neu': 0, 'pos': 0, 'compound': 0}
    return senti_analyzer.polarity_scores(text)

sentiment_scores = reddit_df['full_text'].apply(lambda x: get_sentiment_scores(x))

Assign labels for each of the outputs of the VADER sentiment scores

Code
reddit_df["compound"] = sentiment_scores.apply(lambda x: x["compound"])
reddit_df["positive"] = sentiment_scores.apply(lambda x: x["pos"])
reddit_df["negative"] = sentiment_scores.apply(lambda x: x["neg"])
reddit_df["neutral"] = sentiment_scores.apply(lambda x: x["neu"])
Code
reddit_df["year"] = reddit_df["date"].dt.year

Categorize the data into buckets of different COVID periods

Code
def assign_covid_period(date):
    if date < pd.Timestamp("2020-03-01"):
        return "Pre-COVID"
    elif date >= pd.Timestamp("2020-03-01") and date < pd.Timestamp("2023-05-01"):
        return "During COVID"
    else:
        return "Post-COVID"
    
reddit_df["covid_period"] = reddit_df["date"].apply(assign_covid_period)
Code
yearly_sentiment = reddit_df.groupby('year').agg({
    'compound': 'mean',
    'positive': 'mean',
    'negative': 'mean',
    'neutral': 'mean'
}).round(4)

print("Average Sentiment Scores by Year")
print(yearly_sentiment)
Average Sentiment Scores by Year
      compound  positive  negative  neutral
year                                       
2020   -0.1997    0.1113    0.1775   0.7112
2021   -0.1710    0.0979    0.1558   0.7463
2022   -0.1965    0.1066    0.1556   0.7378
2023   -0.2497    0.1114    0.1573   0.7313
2024   -0.2382    0.1056    0.1620   0.7324
2025   -0.2772    0.1026    0.1616   0.7358

Here, I’m grouping my sentiment data by year to see the long-term trends. I calculate the average for the compound, positive, negative, and neutral scores for each year.

The resulting table shows that the overall compound sentiment is consistently negative across all years, but it becomes even more negative from 2023 to 2025. This suggests that the language in these subreddits has grown more negative in the post-pandemic period compared to during the pandemic itself.

Code
# Group by COVID period
period_statement = reddit_df.groupby('covid_period').agg({
    'compound': ['mean', 'std', 'count'],
    'negative': 'mean',
    'positive': 'mean'
}).round(4)

print("Sentiment by COVID Period")
print(period_statement)
Sentiment by COVID Period
             compound                negative positive
                 mean     std  count     mean     mean
covid_period                                          
During COVID  -0.1967  0.6350  10517   0.1624   0.1049
Post-COVID    -0.2485  0.7024   6218   0.1600   0.1073
Pre-COVID     -0.1463  0.6532    576   0.1747   0.1142

This code groups all my posts into the ‘Pre-COVID’, ‘During COVID’, and ‘Post-COVID’ periods. For each period, it’s calculating the average compound, negative, and positive scores. It also gets the standard deviation (std) and post count for the main compound score to check for consistency and data volume.

This output clearly shows that the overall sentiment (compound score) became more negative during the pandemic (-0.1967) compared to the ‘Pre-COVID’ period (-0.1463). More importantly, the sentiment in the ‘Post-COVID’ period (-0.2485) is even more negative than it was during the pandemic. This suggests that the collective mental health discourse in these subreddits has not improved and has, in fact, trended further into negative territory.

Code
import plotly.express as px
import plotly.io as pio
pio.renderers.default = "notebook"

# Create the base line plot
fig = px.line(
    yearly_sentiment,
    x=yearly_sentiment.index,
    y="compound",
    markers=True,  # Creates the 'o' markers
    text="compound",  # Use the 'compound' column for text labels
    # Set titles and labels
    title="Reddit Mental Health Sentiment Over Time",
    labels={
        "index": "Year",  # 'index' because we passed the index as x
        "compound": "Average Compound Sentiment Score",
    },
)

# Add the vertical 'COVID Start' line
fig.add_vline(
    x=2020,
    line_dash="dash",  # Replaces 'linestyle'
    line_color="red",  # Replaces 'color'
    annotation_text="COVID Start",  # This is Plotly's 'label'
    annotation_position="top right",
)

# Format the text labels on the points
fig.update_traces(
    textposition="bottom right",  # Matches 'ha='center', va='bottom'
    texttemplate="%{text:.3f}",  # Formats the text like '{score:.3f}'
)

# Adjust layout size and grid (Plotly's grid is on by default)
fig.update_layout(
    width=1000,  # Roughly matches figsize=(10, 6)
    height=600,
    xaxis_gridcolor="rgba(0,0,0,0.1)",  # Lighter grid, like 'alpha=0.3'
    yaxis_gridcolor="rgba(0,0,0,0.1)",
)

fig.show()

The data seems to show that the pandemic’s impact on mental health wasn’t a temporary event that ended when the lockdowns lifted. Instead, it appears to have introduced or amplified long term negative scores. While the acute anxiety of the virus and lockdowns faded, they were replaced by chronic issues like economic inflation, job instability, and the stress of “returning to normal”, which created a new set of anxieties as we’ll later explore in this dataset.

Furthermore, the prolonged social isolation may have caused lasting damage to social structures and individual well-being, leading to persistent loneliness and disconnection. The data suggests we are now living with the compounded consequences of the pandemic, which are proving to be just as, or even more, detrimental to collective mental health than the initial crisis itself.

Code
# Check data distribution
print("Posts per year:")
print(reddit_df['year'].value_counts().sort_index())

# Compare pre vs during COVID (if you have that data)
print("\nAverage compound sentiment by period:")
for period in ['Pre-COVID', 'During COVID', 'Post-COVID']:
    data = reddit_df[reddit_df['covid_period'] == period]
    if len(data) > 0:
        print(f"{period}: {data['compound'].mean():.4f} (n={len(data)})")
Posts per year:
year
2020    3536
2021    3567
2022    3159
2023    2775
2024    2993
2025    1281
Name: count, dtype: int64

Average compound sentiment by period:
Pre-COVID: -0.1463 (n=576)
During COVID: -0.1967 (n=10517)
Post-COVID: -0.2485 (n=6218)

I have a solid collection of posts, with thousands from each year between 2020 and 2024. The lower number for 2025 just means I ran the data collection part-way through that year.

The average sentiment was already negative Pre-COVID (-0.1463) but became significantly more negative During COVID (-0.1967). Most importantly, instead of recovering, the sentiment has grown even more negative in the Post-COVID period (-0.2485), suggesting that the pandemic has had a lasting and worsening impact on the mental health discourse in these subreddits.

Code
reddit_sent_df = reddit_df
Code
%store reddit_sent_df
Stored 'reddit_sent_df' (DataFrame)